Foreign Visitors in Brazil - 2005 to 2015 - Part I
Hi Folks, how are you doing?
Well, since it has been a very long time since I post anything I thought it would be nice to start unloading the hard drugs without mercy. I present to you a work about foreign visitors in Brazil between the years 2005 and 2015. This work was divided in three parts because it was really long. Hope you enjoy!
This dataset wes retrieved from a Brazilian government website (http://dados.gov.br/dataset/chegada-turistas , in Portuguese) and contains the number of foreign entries in Brazilian territory divided by State. Data from 2005 to 2015 is being merged from 10 different CSV files. There are 8 variables: Country of Origin, Continent, State they arrived, Access method, Year, Month as Text, Month as number and number of visitors that month.
Unfortunately, all names used are in Portuguese, which should be OK for Country names but not that intuitive for access method: ‘aérea’ means ‘by air’, ‘fluvial’ means ‘by river’, ‘maritima’ means ‘by sea’ and ‘terrestre’ means by land, like by car, bus, on foot, etc. Month names will be avoided, and Month numbers will be used instead.
Before dealing with the data, I have made some previous hypothesis: the majority of visitors in internationally famous cities such as Rio de Janeiro will come from developed countries during Carnival (which happens in February) and big commercial cities such as Sao Paulo will have the largest number of foreign visitors and they will be well distributed over the year.
Some internal codes from this dataset were removed, which are probably used by databases from the government and really do not have any value to our study.
The present dataset has 251,808 entries in total, but the only numeric variables are the number of arrivals and month of the year.
Let’s start with the count of registers whose number of accesses is different then zero per continent. For some reason, there is a category ‘Non specified Continent’ which makes no sense, so it will be removed. Please note the graph had to be flipped to better accomodate the title of the bars.
Looking at the graph it seems the majority come from Europe, but since our data is divided by Country too, and Europe has lots of small countries, it does not mean necessarily more visitors. The following graph is a look into those ‘Non identified Continents’. It seems they arrived mostly in the states of Sao Paulo and Rio de Janeiro, and we will look into it later but this is the first mistery our dataset raised.
After checking to which State those “non specified continent” entries headed, there is another weird thing about this dataset: some states are merged in one groups called “Other States”. Let´s check how many states we have in total:
There are 17 states out of the 26. The missing states are likely merged in this “other states” variable. And visually, this group is the 3rd larger in number of non zero entries. So it makes absolutely no sense to keep this in the database, it should be split in the missing states. That is some good example of Brazilian public services efficiency we are all proud of.
We can also check which access method has more entries.
By far the vast majority of entries are ‘by air’, and this makes sense since Brazil is far from Europe, North America and is considered a continental sized country. It would not be a surprise if all those entries ‘by land’ are from all those ‘Latin America’ entries from the Continent graph.
Next there is the count of entries per year, removing entries with zero arrivals.
There are three recognizable peaks in the graph for 2006, 2008 and 2015.
As mentioned before, it would be expected to see some peaks during summer in south hemisphere due to tourism. And always good to remember that we are still seeing just the number of entries, not the proper number of arrivals.
The difference for the period of November to February can be seen clearly already on the count graph.
For our last analysis, we observe the number of arrivals, first overall, then for all values different then zero.
With a mean of 240 and max of 353122, it seems likely that some very large values are pulling the mean up, specially because the 3rd quartile is 11. However, there are 1920 NA values where it should be none. After further investigation, it seems the file for year 2007 had this problem, as shown below.
In order to work with a more reasonable dataset, we will consider all those to be zeroes, and not some error from the government system.
After running performing this fix, the number of NAs went ot zero, as shown below.
Well, since it has been a very long time since I post anything I thought it would be nice to start unloading the hard drugs without mercy. I present to you a work about foreign visitors in Brazil between the years 2005 and 2015. This work was divided in three parts because it was really long. Hope you enjoy!
This dataset wes retrieved from a Brazilian government website (http://dados.gov.br/dataset/chegada-turistas , in Portuguese) and contains the number of foreign entries in Brazilian territory divided by State. Data from 2005 to 2015 is being merged from 10 different CSV files. There are 8 variables: Country of Origin, Continent, State they arrived, Access method, Year, Month as Text, Month as number and number of visitors that month.
Unfortunately, all names used are in Portuguese, which should be OK for Country names but not that intuitive for access method: ‘aérea’ means ‘by air’, ‘fluvial’ means ‘by river’, ‘maritima’ means ‘by sea’ and ‘terrestre’ means by land, like by car, bus, on foot, etc. Month names will be avoided, and Month numbers will be used instead.
Before dealing with the data, I have made some previous hypothesis: the majority of visitors in internationally famous cities such as Rio de Janeiro will come from developed countries during Carnival (which happens in February) and big commercial cities such as Sao Paulo will have the largest number of foreign visitors and they will be well distributed over the year.
Some internal codes from this dataset were removed, which are probably used by databases from the government and really do not have any value to our study.
## Continent Country
## Europa :93468 Outros países: 22620
## América do Sul :54288 África do Sul: 4524
## Ásia :28680 Alemanha : 4524
## África :22620 Angola : 4524
## América Central e Caribe:21852 Argentina : 4524
## América do Norte :13572 Austrália : 4524
## (Other) :17328 (Other) :206568
## State Access Year
## Outras Unidades da Federação: 28704 Aérea :102912 Min. :2005
## Rio Grande do Sul : 28032 Fluvial : 32736 1st Qu.:2007
## Paraná : 24672 Marítima : 68736 Median :2010
## Santa Catarina : 22032 Terrestre: 47424 Mean :2010
## Amazonas : 14688 3rd Qu.:2013
## Bahia : 14688 Max. :2015
## (Other) :118992
## Month Month.Number Arrivals
## abril : 20984 Min. : 1.00 Min. : 0.0
## agosto : 20984 1st Qu.: 3.75 1st Qu.: 0.0
## dezembro : 20984 Median : 6.50 Median : 0.0
## fevereiro: 20984 Mean : 6.50 Mean : 240.4
## janeiro : 20984 3rd Qu.: 9.25 3rd Qu.: 11.0
## julho : 20984 Max. :12.00 Max. :353122.0
## (Other) :125904 NA's :1920
The present dataset has 251,808 entries in total, but the only numeric variables are the number of arrivals and month of the year.
Let’s start with the count of registers whose number of accesses is different then zero per continent. For some reason, there is a category ‘Non specified Continent’ which makes no sense, so it will be removed. Please note the graph had to be flipped to better accomodate the title of the bars.
Looking at the graph it seems the majority come from Europe, but since our data is divided by Country too, and Europe has lots of small countries, it does not mean necessarily more visitors. The following graph is a look into those ‘Non identified Continents’. It seems they arrived mostly in the states of Sao Paulo and Rio de Janeiro, and we will look into it later but this is the first mistery our dataset raised.
After checking to which State those “non specified continent” entries headed, there is another weird thing about this dataset: some states are merged in one groups called “Other States”. Let´s check how many states we have in total:
There are 17 states out of the 26. The missing states are likely merged in this “other states” variable. And visually, this group is the 3rd larger in number of non zero entries. So it makes absolutely no sense to keep this in the database, it should be split in the missing states. That is some good example of Brazilian public services efficiency we are all proud of.
We can also check which access method has more entries.
By far the vast majority of entries are ‘by air’, and this makes sense since Brazil is far from Europe, North America and is considered a continental sized country. It would not be a surprise if all those entries ‘by land’ are from all those ‘Latin America’ entries from the Continent graph.
Next there is the count of entries per year, removing entries with zero arrivals.
There are three recognizable peaks in the graph for 2006, 2008 and 2015.
As mentioned before, it would be expected to see some peaks during summer in south hemisphere due to tourism. And always good to remember that we are still seeing just the number of entries, not the proper number of arrivals.
The difference for the period of November to February can be seen clearly already on the count graph.
For our last analysis, we observe the number of arrivals, first overall, then for all values different then zero.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 240.4 11.0 353122.0 1920
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3 14 509 114 353122
## Continent Country
## América Central e Caribe:768 Cuba :384
## Europa :768 Guatemala :384
## Ásia :384 Índia :384
## África : 0 República Tcheca:384
## América do Norte : 0 Rússia :384
## América do Sul : 0 África do Sul : 0
## (Other) : 0 (Other) : 0
## State Access Year
## Outras Unidades da Federação:240 Aérea :720 Min. :2007
## Paraná :240 Fluvial :240 1st Qu.:2007
## Rio Grande do Sul :240 Marítima :600 Median :2007
## Santa Catarina :180 Terrestre:360 Mean :2007
## Amazonas :120 3rd Qu.:2007
## Bahia :120 Max. :2007
## (Other) :780
## Month Month.Number Arrivals
## abril :160 1 :160 Min. : NA
## agosto :160 2 :160 1st Qu.: NA
## dezembro :160 3 :160 Median : NA
## fevereiro:160 4 :160 Mean :NaN
## janeiro :160 5 :160 3rd Qu.: NA
## julho :160 6 :160 Max. : NA
## (Other) :960 (Other):960 NA's :1920
In order to work with a more reasonable dataset, we will consider all those to be zeroes, and not some error from the government system.
After running performing this fix, the number of NAs went ot zero, as shown below.
## Continent Country
## África :0 África do Sul :0
## América Central e Caribe :0 Alemanha :0
## América do Norte :0 Angola :0
## América do Sul :0 Arábia Saudita:0
## Ásia :0 Argentina :0
## Continente não especificado:0 Austrália :0
## (Other) :0 (Other) :0
## State Access Year
## Amazonas :0 Aérea :0 Min. : NA
## Bahia :0 Fluvial :0 1st Qu.: NA
## Ceará :0 Marítima :0 Median : NA
## Mato Grosso do Sul :0 Terrestre:0 Mean :NaN
## Outras Unidades da Federação:0 3rd Qu.: NA
## Pará :0 Max. : NA
## (Other) :0
## Month Month.Number Arrivals
## abril :0 1 :0 Min. : NA
## agosto :0 2 :0 1st Qu.: NA
## dezembro :0 3 :0 Median : NA
## fevereiro:0 4 :0 Mean :NaN
## janeiro :0 5 :0 3rd Qu.: NA
## julho :0 6 :0 Max. : NA
## (Other) :0 (Other):0
So basically what we did so far was to make some fixes on the data and analyse the number of entries and the data in general. There are still lots of analysis to be done, and hopefully we will accomplish that in the next posts. ;)
Comments
Post a Comment