Foreign Visitors in Brazil - 2005 to 2015 - Part I

Hi Folks, how are you doing?

Well, since it has been a very long time since I post anything I thought it would be nice to start unloading the hard drugs without mercy. I present to you a work about foreign visitors in Brazil between the years 2005 and 2015. This work was divided in three parts because it was really long. Hope you enjoy!

This dataset wes retrieved from a Brazilian government website (http://dados.gov.br/dataset/chegada-turistas , in Portuguese) and contains the number of foreign entries in Brazilian territory divided by State. Data from 2005 to 2015 is being merged from 10 different CSV files. There are 8 variables: Country of Origin, Continent, State they arrived, Access method, Year, Month as Text, Month as number and number of visitors that month.

Unfortunately, all names used are in Portuguese, which should be OK for Country names but not that intuitive for access method: ‘aérea’ means ‘by air’, ‘fluvial’ means ‘by river’, ‘maritima’ means ‘by sea’ and ‘terrestre’ means by land, like by car, bus, on foot, etc. Month names will be avoided, and Month numbers will be used instead.

Before dealing with the data, I have made some previous hypothesis: the majority of visitors in internationally famous cities such as Rio de Janeiro will come from developed countries during Carnival (which happens in February) and big commercial cities such as Sao Paulo will have the largest number of foreign visitors and they will be well distributed over the year.

Some internal codes from this dataset were removed, which are probably used by databases from the government and really do not have any value to our study.



##                     Continent              Country      
##  Europa                  :93468   Outros países: 22620  
##  América do Sul          :54288   África do Sul:  4524  
##  Ásia                    :28680   Alemanha     :  4524  
##  África                  :22620   Angola       :  4524  
##  América Central e Caribe:21852   Argentina    :  4524  
##  América do Norte        :13572   Austrália    :  4524  
##  (Other)                 :17328   (Other)      :206568  
##                           State              Access            Year     
##  Outras Unidades da Federação: 28704   Aérea    :102912   Min.   :2005  
##  Rio Grande do Sul           : 28032   Fluvial  : 32736   1st Qu.:2007  
##  Paraná                      : 24672   Marítima : 68736   Median :2010  
##  Santa Catarina              : 22032   Terrestre: 47424   Mean   :2010  
##  Amazonas                    : 14688                      3rd Qu.:2013  
##  Bahia                       : 14688                      Max.   :2015  
##  (Other)                     :118992                                    
##        Month         Month.Number      Arrivals       
##  abril    : 20984   Min.   : 1.00   Min.   :     0.0  
##  agosto   : 20984   1st Qu.: 3.75   1st Qu.:     0.0  
##  dezembro : 20984   Median : 6.50   Median :     0.0  
##  fevereiro: 20984   Mean   : 6.50   Mean   :   240.4  
##  janeiro  : 20984   3rd Qu.: 9.25   3rd Qu.:    11.0  
##  julho    : 20984   Max.   :12.00   Max.   :353122.0  
##  (Other)  :125904                   NA's   :1920

The present dataset has 251,808 entries in total, but the only numeric variables are the number of arrivals and month of the year.

Let’s start with the count of registers whose number of accesses is different then zero per continent. For some reason, there is a category ‘Non specified Continent’ which makes no sense, so it will be removed. Please note the graph had to be flipped to better accomodate the title of the bars.






Looking at the graph it seems the majority come from Europe, but since our data is divided by Country too, and Europe has lots of small countries, it does not mean necessarily more visitors. The following graph is a look into those ‘Non identified Continents’. It seems they arrived mostly in the states of Sao Paulo and Rio de Janeiro, and we will look into it later but this is the first mistery our dataset raised.



After checking to which State those “non specified continent” entries headed, there is another weird thing about this dataset: some states are merged in one groups called “Other States”. Let´s check how many states we have in total:



There are 17 states out of the 26. The missing states are likely merged in this “other states” variable. And visually, this group is the 3rd larger in number of non zero entries. So it makes absolutely no sense to keep this in the database, it should be split in the missing states. That is some good example of Brazilian public services efficiency we are all proud of.


We can also check which access method has more entries.




By far the vast majority of entries are ‘by air’, and this makes sense since Brazil is far from Europe, North America and is considered a continental sized country. It would not be a surprise if all those entries ‘by land’ are from all those ‘Latin America’ entries from the Continent graph.


Next there is the count of entries per year, removing entries with zero arrivals.





There are three recognizable peaks in the graph for 2006, 2008 and 2015.

As mentioned before, it would be expected to see some peaks during summer in south hemisphere due to tourism. And always good to remember that we are still seeing just the number of entries, not the proper number of arrivals.




The difference for the period of November to February can be seen clearly already on the count graph.

For our last analysis, we observe the number of arrivals, first overall, then for all values different then zero.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0      0.0    240.4     11.0 353122.0     1920
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       3      14     509     114  353122

With a mean of 240 and max of 353122, it seems likely that some very large values are pulling the mean up, specially because the 3rd quartile is 11. However, there are 1920 NA values where it should be none. After further investigation, it seems the file for year 2007 had this problem, as shown below.


##                     Continent               Country   
##  América Central e Caribe:768   Cuba            :384  
##  Europa                  :768   Guatemala       :384  
##  Ásia                    :384   Índia           :384  
##  África                  :  0   República Tcheca:384  
##  América do Norte        :  0   Rússia          :384  
##  América do Sul          :  0   África do Sul   :  0  
##  (Other)                 :  0   (Other)         :  0  
##                           State           Access         Year     
##  Outras Unidades da Federação:240   Aérea    :720   Min.   :2007  
##  Paraná                      :240   Fluvial  :240   1st Qu.:2007  
##  Rio Grande do Sul           :240   Marítima :600   Median :2007  
##  Santa Catarina              :180   Terrestre:360   Mean   :2007  
##  Amazonas                    :120                   3rd Qu.:2007  
##  Bahia                       :120                   Max.   :2007  
##  (Other)                     :780                                 
##        Month      Month.Number    Arrivals   
##  abril    :160   1      :160   Min.   : NA   
##  agosto   :160   2      :160   1st Qu.: NA   
##  dezembro :160   3      :160   Median : NA   
##  fevereiro:160   4      :160   Mean   :NaN   
##  janeiro  :160   5      :160   3rd Qu.: NA   
##  julho    :160   6      :160   Max.   : NA   
##  (Other)  :960   (Other):960   NA's   :1920


In order to work with a more reasonable dataset, we will consider all those to be zeroes, and not some error from the government system.

After running performing this fix, the number of NAs went ot zero, as shown below.


##                        Continent           Country 
##  África                     :0   África do Sul :0  
##  América Central e Caribe   :0   Alemanha      :0  
##  América do Norte           :0   Angola        :0  
##  América do Sul             :0   Arábia Saudita:0  
##  Ásia                       :0   Argentina     :0  
##  Continente não especificado:0   Austrália     :0  
##  (Other)                    :0   (Other)       :0  
##                           State         Access       Year    
##  Amazonas                    :0   Aérea    :0   Min.   : NA  
##  Bahia                       :0   Fluvial  :0   1st Qu.: NA  
##  Ceará                       :0   Marítima :0   Median : NA  
##  Mato Grosso do Sul          :0   Terrestre:0   Mean   :NaN  
##  Outras Unidades da Federação:0                 3rd Qu.: NA  
##  Pará                        :0                 Max.   : NA  
##  (Other)                     :0                              
##        Month    Month.Number    Arrivals  
##  abril    :0   1      :0     Min.   : NA  
##  agosto   :0   2      :0     1st Qu.: NA  
##  dezembro :0   3      :0     Median : NA  
##  fevereiro:0   4      :0     Mean   :NaN  
##  janeiro  :0   5      :0     3rd Qu.: NA  
##  julho    :0   6      :0     Max.   : NA  
##  (Other)  :0   (Other):0
So basically what we did so far was to make some fixes on the data and analyse the number of entries and the data in general. There are still lots of analysis to be done, and hopefully we will accomplish that in the next posts. ;)

Comments

Popular posts from this blog

Dealing With Large Features in Git Repos

Distributed Computing with Spark SQL (Coursera)

Foreign Visitors in Brazil - 2005 to 2015 - Part II