SAS and Twitter–how to harness SAS to grab data from Twitter in 2 easy steps

I recently published a post titled, “4 Key Tweeting Attributes of Guy Kawasaki in one Infographic.” I made extensive use of SAS to gather and manipulate the data from Twitter. Turns out, SAS is pretty awesome for this type of work. In this post I’m going to document how to use SAS to gather data from Twitter’s API. My next post on SAS and Twitter will build off of this one and teach you how to gather data about your subject’s followers, find ReTweets, and listen in on conversations. Click here to get that post delivered to your inbox as soon as it’s published.

First off, you might wonder, why do this? Well, successful analyzers of the future will be adept at analyzing all sorts of data, including data from social networks, like Twitter. Also, if you’re looking to market your analytical skills, what hiring manager wouldn’t be impressed with someone who gathered data from Twitter’s API with SAS, then mined, analyzed, and presented the data in a compelling way. Oh, almost forgot, because you’re analyzing a current event (it’s on Twitter, right?) and mentioning Twitter in your post, your analysis will be more search engine friendly, so you’ll likely get a wider and more targeted audience than if you analyzed something outside of the Twitterverse. Some smart analyzers have even been known to analyze Tweets about their target employer and use the analysis to help get themselves hired. On a larger scale, this is almost exactly what Seth Godin has done with Brands in Public.

Before we get started I have to tell you a little about Twitter’s rate limiting policy. Unfortunately, the search area of Twitter’s API doesn’t have a hard rate limit. Rather, Twitter says they allow a rate limit quite a bit higher than their standard 150 hits/hour, but they decline to say how much. Full documentation can be found here, about 1/2 down the page. I have run afoul of the limit before and guess that it’s around 600 hits per hour or more than 30 per minute. When you exceed the unpublished rate, you have to wait between 1-3 hours for your ip address to be allowed to his Twitter again. If you’re just searching for someone’s post, like we’re doing with Guy Kawasaki, you needn’t worry about getting anywhere near Twitter’s rate limit.

Ok, so now let’s get started.

Step 1:
After you figure out what you want to search for (this site is a good start to find trends, and they graph them out for you), you’ll need to plug your search term into the url string that your SAS program will use. If you’re searching for a person, like I did, your string will look like this:

http://search.twitter.com/search.atom?q=from%3Aguykawasaki&rpp=100

The ‘q=from’ tells Twitter that you’re searching for Tweets from a specific user. The ‘%3A’ is url encoding for a ‘:’. And the ‘&rpp’ tells Twitter to return the maximum (100) items per page. You can copy and paste that string into your browser right now and get back some nicely formatted xml representing Guy’s last 100 Tweets.

Step 2: Ok, you know what you’re searching for and how to format the url string to get your results. But Twitter returns a paltry 100 results at a time. You’re a SAS user, you don’t work with 100 record data sets! You want more, so you wrap your code in a macro, key off of Twitter’s page= parameter to get older results, and append the new results to your master dataset. Twitter will generally allow you to pull down 1 week’s worth of search results. The code to do this is located here.

That’s enough to get you started. You now have a SAS data set with lots of Twitter data, including text to mine, dates and times to trend out, and, hopefully, an interesting topic to help show showcase your analytical prowess to your audience.

You can access the full code here.

Don’t forget to come back in about 2 weeks to read my post on how to wrangle and append other data from Twitter to your search dataset. Or, better yet, click here and get all of my posts in your inbox as soon as they’re published.

12 thoughts on “SAS and Twitter–how to harness SAS to grab data from Twitter in 2 easy steps

  1. Hi there,
    Great article, however trying you code at this link http://bit.ly/b2ppJx returns errors, im running in SAS 9.2 but get the below errors:

    Do I need to have created an XML map first or something ?

    Hope you can help.

    Cheers

    100 /*dataset to hold the Tweet content*/
    101 data guy;
    102 length deleteme $ 1;
    103 deleteme=’Y';
    104 run;

    NOTE: Compression was disabled for data set WORK.GUY because compression overhead would increase the
    size of the data set.
    NOTE: The data set WORK.GUY has 1 observations and 1 variables.
    NOTE: DATA statement used (Total process time):
    real time 0.00 seconds
    cpu time 0.00 seconds

    105 options mprint merror;
    106 %macro getpage();
    107 %let pageno=1;
    108 %loopme:
    109 %let fullfeed=%nrstr(http://search.twitter.com/search.atom?q=from%3Aguykawasaki&rpp=100&page=);
    110 /**enter your search term
    110! above**/
    111
    112 filename twit URL “&fullfeed&pageno”;
    113 libname tf XML xmlfileref=twit xmlmap=twitfile;
    114
    115 data guy;
    116 set tf.entry guy;
    117 if deleteme=’Y’ then delete;
    118 *get rid of the bugus record;
    119 run;
    120
    121 data countme;
    122 set tf.entry;
    123 run;
    124
    125 proc contents data=countme noprint out=count;
    126 run;
    127
    128 data _null_;
    129 set count;
    130 call symput(‘lastcnt’,nobs);
    131 run;
    132 %let pageno=%eval(&pageno+1);
    133 %put &lastcnt;
    134 /**need to check how many records come back in the XML file from Twitter. When there’s 0 records,
    134! Twitter has gone back as far as it can and it won’t return any more data from
    135 the search. Typically, Twitter will release one week’s worth of search data**/
    136 %if %eval(&lastcnt) =0 %then %goto endme; %else %goto loopme;
    137 %endme:
    138 run;
    139 %mend;
    140 %getpage();
    MPRINT(GETPAGE): filename twit URL
    “http://search.twitter.com/search.atom?q=from%3Aguykawasaki&rpp=100&page=1″;
    MPRINT(GETPAGE): libname tf XML xmlfileref=twit xmlmap=twitfile;
    NOTE: Libref TF was successfully assigned as follows:
    Engine: XML
    Physical Name:
    MPRINT(GETPAGE): data guy;
    MPRINT(GETPAGE): set tf.entry guy;
    MPRINT(GETPAGE): * FROM ENTRY IN FILE ”;
    ERROR: Physical file does not exist, C:Documents and Settingsq743047twitfile.
    encountered during XMLMap parsing
    occurred at or near line 1, column 1
    ERROR: XML describe error: Internal processing error.
    MPRINT(GETPAGE): if deleteme=’Y’ then delete;
    MPRINT(GETPAGE): *get rid of the bugus record;
    MPRINT(GETPAGE): run;

    NOTE: Compression was disabled for data set WORK.GUY because compression overhead would increase the
    size of the data set.
    NOTE: The SAS System stopped processing this step because of errors.
    WARNING: The data set WORK.GUY may be incomplete. When this step was stopped there were 0
    observations and 1 variables.
    WARNING: Data set WORK.GUY was not replaced because this step was stopped.
    NOTE: DATA statement used (Total process time):
    real time 0.00 seconds
    cpu time 0.00 seconds

    MPRINT(GETPAGE): data countme;
    MPRINT(GETPAGE): set tf.entry;
    MPRINT(GETPAGE): * FROM ENTRY IN FILE ”;
    ERROR: Physical file does not exist, C:Documents and Settingsq743047twitfile.
    encountered during XMLMap parsing
    occurred at or near line 1, column 1
    ERROR: XML describe error: Internal processing error.
    MPRINT(GETPAGE): run;
    NOTE: Compression was disabled for data set WORK.COUNTME because compression overhead would increase
    the size of the data set.
    NOTE: The SAS System stopped processing this step because of errors.
    WARNING: The data set WORK.COUNTME may be incomplete. When this step was stopped there were 0
    observations and 0 variables.
    WARNING: Data set WORK.COUNTME was not replaced because this step was stopped.
    NOTE: DATA statement used (Total process time):
    real time 0.01 seconds
    cpu time 0.01 seconds

    MPRINT(GETPAGE): proc contents data=countme noprint out=count;
    MPRINT(GETPAGE): run;

    NOTE: The data set WORK.COUNT has 0 observations and 40 variables.
    NOTE: PROCEDURE CONTENTS used (Total process time):
    real time 0.01 seconds
    cpu time 0.00 seconds

    MPRINT(GETPAGE): data _null_;
    MPRINT(GETPAGE): set count;
    MPRINT(GETPAGE): call symput(‘lastcnt’,nobs);
    MPRINT(GETPAGE): run;
    NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
    249:148
    NOTE: There were 0 observations read from the data set WORK.COUNT.
    NOTE: DATA statement used (Total process time):
    real time 0.00 seconds
    cpu time 0.00 seconds

    WARNING: Apparent symbolic reference LASTCNT not resolved.
    &lastcnt
    WARNING: Apparent symbolic reference LASTCNT not resolved.
    WARNING: Apparent symbolic reference LASTCNT not resolved.
    ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand
    is required. The condition was: &lastcnt
    ERROR: The macro GETPAGE will stop executing.

  2. @colm
    The code you ran is just a section of the overall code that I published at http://bit.ly/9fOsLl. The code at that link includes the XML map, the one which you thought you’d need. So you’re correct, SAS needs that XML map to show it how to parse the xml file that the url string returns. You should be able to run the full code, verbatim, and get back his last 1,000 tweets (appprox), including repeats, from Guy Kawasaki’s Twitter Feed.
    If you’re interested in going back a lot further, come back next week, or sign up for my newsletter here and learn how to do that and a few other tricks.

    Thanks,

    John C. Munoz

  3. Hey John,

    This is really a cool stuff. I am a student and I find the data collected from twitter useful for many types of analysis. Can we do the same kind of stuff to fetch Facebook status updates and their comments? Can that be feasible? I was browsing on the Facebook status updates APIs and I don’t have any clue how to use them. Please share your thoughts.

    Thanks
    Satish

    1. Satish,

      I don’t think Facebook status updates are accessible unless you’re friends with the person whose updates you’re trying to access. So that makes FB not nearly as much fun to play with as Twitter.

      Thanks for your comment. I hope to be posting a new SAS/Twitter blog entry shortly, so stay tuned!

  4. Hi John,

    Thanks for coming back to me, just spotted that also this morning, I have tried the full code but getting the below error.

    Not sure if this is anything to do with recent changes to Twitter API ? Any ideas ?

    Cheers

    Colm

    NOTE: Unable to connect to host search.twitter.com. Check validity of host name.
    ERROR: Hostname search.twitter.com not found.
    encountered during XMLInput parsing

    1. Hi Colm,

      I ran the code verbatim and didn’t receive any errors. However, I also didn’t get back any records, which I know isn’t right because Guy Kawasaki is still out there tweeting away. I tested the url string and searched for other uses, and did get back results. So that’s pretty weird. Perhaps there’s something new in the API which is limiting search results for specific users, or from users who tweet a lot. Who knows? In any case, try changing guykawasaki to johncmunoz and see what happens. You should get back at least 1 record in your dataset.

  5. Hey John,
    Tried changing the user etc but still get the following error

    ERROR: Hostname search.twitter.com not found.
    encountered during XMLInput parsing
    occurred at or near line 1, column 1

    Yet when i user the host name in the browser address bar its fine, returns the xml RSS feed etc.

    Is there any other ways to just get my tweets into a dataset ?

    Cheers

    Colm

    1. Hi Colm,

      I’m sorry to hear about the trouble you’re having. I ran the full code verbatim again, and it worked without any errors. Perhaps the problem is with SAS. What version are you running? This is a wild guess, but perhaps your version doesn’t support the XML libname functionality that my code utilizes. I think SAS came out with the XML libname engine in 9.1.

      If I were you I’d give SAS’ support a call. They can almost always get to the bottom of something like this. Their U.S. number is 919-677-8008. Provide them with your site number and tell them you need help with the xml libname statement.

      There are a couple of other ways to access Twitter with SAS. Google proc soap and proc http. Also, there’s a good SAS blog post at http://blogs.sas.com/supportnews/index.php?/archives/72-Using-SAS-to-call-Twitter.html that goes into detail on proc http.

      Good luck! Let us know how it turns out.

  6. Hi John,

    I tried your code but encountered the following error during the data steps (e.g. data countme; set tf.entry; run;). Do you happen to know why? I am using server SAS 9.1.3. Thank you!

    – Patrick

    ERROR: Unable to init socket API.
    ERROR: Invalid open mode.
    encountered during XMLInput parsing
    occurred at or near line 1, column 1

  7. John, Hi
    Just found your post and tried your code, it works great.
    When I tried it first I changed the search term and made the mistake of erasing part of the code as I shown it here:
    Right:
    from%3Aguykawasaki
    Wrong
    from%3guykawasaki
    Great post, thank you

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>