I recently published a post titled, “4 Key Tweeting Attributes of Guy Kawasaki in one Infographic.” I made extensive use of SAS to gather and manipulate the data from Twitter. Turns out, SAS is pretty awesome for this type of work. In this post I’m going to document how to use SAS to gather data from Twitter’s API. My next post on SAS and Twitter will build off of this one and teach you how to gather data about your subject’s followers, find ReTweets, and listen in on conversations. Click here to get that post delivered to your inbox as soon as it’s published.
First off, you might wonder, why do this? Well, successful analyzers of the future will be adept at analyzing all sorts of data, including data from social networks, like Twitter. Also, if you’re looking to market your analytical skills, what hiring manager wouldn’t be impressed with someone who gathered data from Twitter’s API with SAS, then mined, analyzed, and presented the data in a compelling way. Oh, almost forgot, because you’re analyzing a current event (it’s on Twitter, right?) and mentioning Twitter in your post, your analysis will be more search engine friendly, so you’ll likely get a wider and more targeted audience than if you analyzed something outside of the Twitterverse. Some smart analyzers have even been known to analyze Tweets about their target employer and use the analysis to help get themselves hired. On a larger scale, this is almost exactly what Seth Godin has done with Brands in Public.
Before we get started I have to tell you a little about Twitter’s rate limiting policy. Unfortunately, the search area of Twitter’s API doesn’t have a hard rate limit. Rather, Twitter says they allow a rate limit quite a bit higher than their standard 150 hits/hour, but they decline to say how much. Full documentation can be found here, about 1/2 down the page. I have run afoul of the limit before and guess that it’s around 600 hits per hour or more than 30 per minute. When you exceed the unpublished rate, you have to wait between 1-3 hours for your ip address to be allowed to his Twitter again. If you’re just searching for someone’s post, like we’re doing with Guy Kawasaki, you needn’t worry about getting anywhere near Twitter’s rate limit.
Ok, so now let’s get started.
Step 1:
After you figure out what you want to search for (this site is a good start to find trends, and they graph them out for you), you’ll need to plug your search term into the url string that your SAS program will use. If you’re searching for a person, like I did, your string will look like this:
http://search.twitter.com/search.atom?q=from%3Aguykawasaki&rpp=100
The ‘q=from’ tells Twitter that you’re searching for Tweets from a specific user. The ‘%3A’ is url encoding for a ‘:’. And the ‘&rpp’ tells Twitter to return the maximum (100) items per page. You can copy and paste that string into your browser right now and get back some nicely formatted xml representing Guy’s last 100 Tweets.
Step 2: Ok, you know what you’re searching for and how to format the url string to get your results. But Twitter returns a paltry 100 results at a time. You’re a SAS user, you don’t work with 100 record data sets! You want more, so you wrap your code in a macro, key off of Twitter’s page= parameter to get older results, and append the new results to your master dataset. Twitter will generally allow you to pull down 1 week’s worth of search results. The code to do this is located here.
That’s enough to get you started. You now have a SAS data set with lots of Twitter data, including text to mine, dates and times to trend out, and, hopefully, an interesting topic to help show showcase your analytical prowess to your audience.
You can access the full code here.
Don’t forget to come back in about 2 weeks to read my post on how to wrangle and append other data from Twitter to your search dataset. Or, better yet, click here and get all of my posts in your inbox as soon as they’re published.
Pingback: Key Happenings at support.sas.com - Sharing Updates and advances in SAS Online Support
Hi there,
Great article, however trying you code at this link http://bit.ly/b2ppJx returns errors, im running in SAS 9.2 but get the below errors:
Do I need to have created an XML map first or something ?
Hope you can help.
Cheers
100 /*dataset to hold the Tweet content*/
101 data guy;
102 length deleteme $ 1;
103 deleteme=’Y’;
104 run;
NOTE: Compression was disabled for data set WORK.GUY because compression overhead would increase the
size of the data set.
NOTE: The data set WORK.GUY has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
105 options mprint merror;
106 %macro getpage();
107 %let pageno=1;
108 %loopme:
109 %let fullfeed=%nrstr(http://search.twitter.com/search.atom?q=from%3Aguykawasaki&rpp=100&page=);
110 /**enter your search term
110! above**/
111
112 filename twit URL “&fullfeed&pageno”;
113 libname tf XML xmlfileref=twit xmlmap=twitfile;
114
115 data guy;
116 set tf.entry guy;
117 if deleteme=’Y’ then delete;
118 *get rid of the bugus record;
119 run;
120
121 data countme;
122 set tf.entry;
123 run;
124
125 proc contents data=countme noprint out=count;
126 run;
127
128 data _null_;
129 set count;
130 call symput(‘lastcnt’,nobs);
131 run;
132 %let pageno=%eval(&pageno+1);
133 %put &lastcnt;
134 /**need to check how many records come back in the XML file from Twitter. When there’s 0 records,
134! Twitter has gone back as far as it can and it won’t return any more data from
135 the search. Typically, Twitter will release one week’s worth of search data**/
136 %if %eval(&lastcnt) =0 %then %goto endme; %else %goto loopme;
137 %endme:
138 run;
139 %mend;
140 %getpage();
MPRINT(GETPAGE): filename twit URL
“http://search.twitter.com/search.atom?q=from%3Aguykawasaki&rpp=100&page=1″;
MPRINT(GETPAGE): libname tf XML xmlfileref=twit xmlmap=twitfile;
NOTE: Libref TF was successfully assigned as follows:
Engine: XML
Physical Name:
MPRINT(GETPAGE): data guy;
MPRINT(GETPAGE): set tf.entry guy;
MPRINT(GETPAGE): * FROM ENTRY IN FILE ”;
ERROR: Physical file does not exist, C:\Documents and Settings\q743047\twitfile.
encountered during XMLMap parsing
occurred at or near line 1, column 1
ERROR: XML describe error: Internal processing error.
MPRINT(GETPAGE): if deleteme=’Y’ then delete;
MPRINT(GETPAGE): *get rid of the bugus record;
MPRINT(GETPAGE): run;
NOTE: Compression was disabled for data set WORK.GUY because compression overhead would increase the
size of the data set.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.GUY may be incomplete. When this step was stopped there were 0
observations and 1 variables.
WARNING: Data set WORK.GUY was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
MPRINT(GETPAGE): data countme;
MPRINT(GETPAGE): set tf.entry;
MPRINT(GETPAGE): * FROM ENTRY IN FILE ”;
ERROR: Physical file does not exist, C:\Documents and Settings\q743047\twitfile.
encountered during XMLMap parsing
occurred at or near line 1, column 1
ERROR: XML describe error: Internal processing error.
MPRINT(GETPAGE): run;
NOTE: Compression was disabled for data set WORK.COUNTME because compression overhead would increase
the size of the data set.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.COUNTME may be incomplete. When this step was stopped there were 0
observations and 0 variables.
WARNING: Data set WORK.COUNTME was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
MPRINT(GETPAGE): proc contents data=countme noprint out=count;
MPRINT(GETPAGE): run;
NOTE: The data set WORK.COUNT has 0 observations and 40 variables.
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
MPRINT(GETPAGE): data _null_;
MPRINT(GETPAGE): set count;
MPRINT(GETPAGE): call symput(‘lastcnt’,nobs);
MPRINT(GETPAGE): run;
NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
249:148
NOTE: There were 0 observations read from the data set WORK.COUNT.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
WARNING: Apparent symbolic reference LASTCNT not resolved.
&lastcnt
WARNING: Apparent symbolic reference LASTCNT not resolved.
WARNING: Apparent symbolic reference LASTCNT not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand
is required. The condition was: &lastcnt
ERROR: The macro GETPAGE will stop executing.
@colm
The code you ran is just a section of the overall code that I published at http://bit.ly/9fOsLl. The code at that link includes the XML map, the one which you thought you’d need. So you’re correct, SAS needs that XML map to show it how to parse the xml file that the url string returns. You should be able to run the full code, verbatim, and get back his last 1,000 tweets (appprox), including repeats, from Guy Kawasaki’s Twitter Feed.
If you’re interested in going back a lot further, come back next week, or sign up for my newsletter here and learn how to do that and a few other tricks.
Thanks,
John C. Munoz
Hey John,
This is really a cool stuff. I am a student and I find the data collected from twitter useful for many types of analysis. Can we do the same kind of stuff to fetch Facebook status updates and their comments? Can that be feasible? I was browsing on the Facebook status updates APIs and I don’t have any clue how to use them. Please share your thoughts.
Thanks
Satish
Satish,
I don’t think Facebook status updates are accessible unless you’re friends with the person whose updates you’re trying to access. So that makes FB not nearly as much fun to play with as Twitter.
Thanks for your comment. I hope to be posting a new SAS/Twitter blog entry shortly, so stay tuned!
Hi John,
Thanks for coming back to me, just spotted that also this morning, I have tried the full code but getting the below error.
Not sure if this is anything to do with recent changes to Twitter API ? Any ideas ?
Cheers
Colm
NOTE: Unable to connect to host search.twitter.com. Check validity of host name.
ERROR: Hostname search.twitter.com not found.
encountered during XMLInput parsing
Hi Colm,
I ran the code verbatim and didn’t receive any errors. However, I also didn’t get back any records, which I know isn’t right because Guy Kawasaki is still out there tweeting away. I tested the url string and searched for other uses, and did get back results. So that’s pretty weird. Perhaps there’s something new in the API which is limiting search results for specific users, or from users who tweet a lot. Who knows? In any case, try changing guykawasaki to johncmunoz and see what happens. You should get back at least 1 record in your dataset.
Hey John,
Tried changing the user etc but still get the following error
ERROR: Hostname search.twitter.com not found.
encountered during XMLInput parsing
occurred at or near line 1, column 1
Yet when i user the host name in the browser address bar its fine, returns the xml RSS feed etc.
Is there any other ways to just get my tweets into a dataset ?
Cheers
Colm
Hi Colm,
I’m sorry to hear about the trouble you’re having. I ran the full code verbatim again, and it worked without any errors. Perhaps the problem is with SAS. What version are you running? This is a wild guess, but perhaps your version doesn’t support the XML libname functionality that my code utilizes. I think SAS came out with the XML libname engine in 9.1.
If I were you I’d give SAS’ support a call. They can almost always get to the bottom of something like this. Their U.S. number is 919-677-8008. Provide them with your site number and tell them you need help with the xml libname statement.
There are a couple of other ways to access Twitter with SAS. Google proc soap and proc http. Also, there’s a good SAS blog post at http://blogs.sas.com/supportnews/index.php?/archives/72-Using-SAS-to-call-Twitter.html that goes into detail on proc http.
Good luck! Let us know how it turns out.
Hi John,
I tried your code but encountered the following error during the data steps (e.g. data countme; set tf.entry; run;). Do you happen to know why? I am using server SAS 9.1.3. Thank you!
– Patrick
ERROR: Unable to init socket API.
ERROR: Invalid open mode.
encountered during XMLInput parsing
occurred at or near line 1, column 1
Hi Patrick,
Sorry for the delayed response. Were you able to get the code working? Let me know if you weren’t and I’ll see what we can do.
John, Hi
Just found your post and tried your code, it works great.
When I tried it first I changed the search term and made the mistake of erasing part of the code as I shown it here:
Right:
from%3Aguykawasaki
Wrong
from%3guykawasaki
Great post, thank you
Pingback: SAS Portal: How to Follow Tweets … | BI Notes for SAS BI Users
THanks for this code. I am new to using XML code though. Can you explain what I need to do to get the rest of a twitter user’s tweets? In other words, how can I go back more than a week? Also, how would I search for a subject on Twitter (if I don’t want to look up tweets from one user)?
Thanks again!
Hi Kristen,
From what I’ve read and know about Twitter’s API, there is no way to go back more than 1 week or x number of Tweets using Twitter’s own API. However, I think there are some sites on the web that will let you hop in a time machine and exceed Twitter’s thresholds. One is snapbird.org
To search for a specific subject, say you want to find recent tweets with the word ‘Eagles’ in it, the string would look like this:
http://search.twitter.com/search.json?q=%40Eagles
Details about the search api are at: https://dev.twitter.com/docs/using-search
Good luck!